A Statistical Approach to Automatic OCR Error Correction in Context

نویسندگان

Xiang Tong

David A. Evans

چکیده

This paper describes an automatic, context-sensitive, word-error correction system based on statistical language modeling (SLM) as applied to optical character recognition (OCR) postprocessing. The system exploits information from multiple sources, including letter n-grams, character confusion probabilities, and word-bigram probabilities. Letter n-grams are used to index the words in the lexicon. Given a sentence to be corrected, the system decomposes each string in the sentence into letter n-grams and retrieves word candidates from the lexicon by comparing string n-grams with lexicon-entry n-grams. The retrieved candidates are ranked by the conditional probability of matches with the string, given character confusion probabilities. Finally, the wordobigram model and Viterbi algorithm are used to determine the best scoring word sequence for the sentence. The system can correct non-word errors as well as real-word errors and achieves a 60.2% error reduction rate for real OCR text. In addition, the system can learn the character confusion probabilities for a specific OCR environment and use them in self-calibration to achieve better performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

OCR Error Correction Using Statistical Machine Translation

In this paper, we explore the use of a statistical machine translation system for optical character recognition (OCR) error correction. We investigate the use of word and character-level models to support a translation from OCR system output to correct french text. Our experiments show that character and word based machine translation correction make significant improvements to the quality of t...

متن کامل

Diploma Thesis: Unsupervised Post-Correction of OCR Errors

The trend to digitize (historic) paper-based archives has emerged in the last years. The advantages of digital archives are easy access, searchability and machine readability. These advantages can only be ensured if few or no OCR errors are present. These errors are the result of misrecognized characters during the OCR process. Large archives make it unreasonable to correct errors manually. The...

متن کامل

Educational Context and ELT Teachers’ Corrective Feedback Preference: Public and Private School Teachers in Focus

This study investigated the possible relationship between educational context and English Language Teaching (ELT) teachers’ corrective feedback preference. To this end, 42 Iranian EEFL teachers from some private language institutes and 39 Iranian EFL teachers from different schools in Shiraz, Iran participated in the study. The Questionnaire for Corrective Feedback Approaches (QCFAs) was ...

متن کامل

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Since the dawn of the computing era, information has been represented digitally so that it can be processed by electronic computers. Paper books and documents were abundant and widely being published at that time; and hence, there was a need to convert them into digital format. OCR, short for Optical Character Recognition was conceived to translate paper-based books into digital e-books. Regret...

متن کامل

Low-resource OCR error detection and correction in French Clinical Texts

In this paper we present a simple yet effective approach to automatic OCR error detection and correction on a corpus of French clinical reports of variable OCR quality within the domain of foetopathology. While traditional OCR error detection and correction systems rely heavily on external information such as domain-specific lexicons, OCR process information or manually corrected training mater...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1996

A Statistical Approach to Automatic OCR Error Correction in Context

نویسندگان

چکیده

منابع مشابه

OCR Error Correction Using Statistical Machine Translation

Diploma Thesis: Unsupervised Post-Correction of OCR Errors

Educational Context and ELT Teachers’ Corrective Feedback Preference: Public and Private School Teachers in Focus

OCR Context-Sensitive Error Correction Based on Google Web 1T 5-Gram Data Set

Low-resource OCR error detection and correction in French Clinical Texts

عنوان ژورنال:

اشتراک گذاری